Saint Michael
- Oceania > Australia > New South Wales > Sydney (0.14)
- Europe > Poland > Greater Poland Province > Poznań (0.05)
- North America > Canada > Quebec > Montreal (0.05)
- (16 more...)
- Oceania > Australia > New South Wales > Sydney (0.14)
- Europe > Poland > Greater Poland Province > Poznań (0.05)
- North America > Canada > Quebec > Montreal (0.05)
- (16 more...)
Context Length Alone Hurts LLM Performance Despite Perfect Retrieval
Du, Yufeng, Tian, Minyang, Ronanki, Srikanth, Rongali, Subendhu, Bodapati, Sravan, Galstyan, Aram, Wells, Azton, Schwartz, Roy, Huerta, Eliu A, Peng, Hao
Large language models (LLMs) often fail to scale their performance on long-context tasks performance in line with the context lengths they support. This gap is commonly attributed to retrieval failures -- the models' inability to identify relevant information in the long inputs. Accordingly, recent efforts often focus on evaluating and improving LLMs' retrieval performance: if retrieval is perfect, a model should, in principle, perform just as well on a long input as it does on a short one -- or should it? This paper presents findings that the answer to this question may be negative. Our systematic experiments across 5 open- and closed-source LLMs on math, question answering, and coding tasks reveal that, even when models can perfectly retrieve all relevant information, their performance still degrades substantially (13.9%--85%) as input length increases but remains well within the models' claimed lengths. This failure occurs even when the irrelevant tokens are replaced with minimally distracting whitespace, and, more surprisingly, when they are all masked and the models are forced to attend only to the relevant tokens. A similar performance drop is observed when all relevant evidence is placed immediately before the question. Our findings reveal a previously-unrealized limitation: the sheer length of the input alone can hurt LLM performance, independent of retrieval quality and without any distraction. They motivate our simple, model-agnostic mitigation strategy that transforms a long-context task into a short-context one by prompting the model to recite the retrieved evidence before attempting to solve the problem. On RULER, we observe a consistent improvement of GPT-4o up to 4% on an already strong baseline.
- North America > United States > California > Los Angeles County > Los Angeles (0.04)
- North America > Barbados > Saint Michael > Bridgetown (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- (5 more...)
- Media > Film (0.95)
- Leisure & Entertainment (0.95)
- Government > Regional Government > North America Government > United States Government (0.68)
Adaptive Sparse Softmax: An Effective and Efficient Softmax Variant
Lv, Qi, Geng, Lei, Cao, Ziqiang, Cao, Min, Li, Sujian, Li, Wenjie, Fu, Guohong
Softmax with the cross entropy loss is the standard configuration for current neural classification models. The gold score for a target class is supposed to be 1, but it is never reachable under the softmax schema. Such a problem makes the training process continue forever and leads to overfitting. Moreover, the "target-approach-1" training goal forces the model to continuously learn all samples, leading to a waste of time in handling some samples which have already been classified correctly with high confidence, while the test goal simply requires the target class of each sample to hold the maximum score. To solve the above weaknesses, we propose the Adaptive Sparse softmax (AS-Softmax) which designs a reasonable and test-matching transformation on top of softmax. For more purposeful learning, we discard the classes with far smaller scores compared with the actual class during training. Then the model could focus on learning to distinguish the target class from its strong opponents, which is also the great challenge in test. In addition, since the training losses of easy samples will gradually drop to 0 in AS-Softmax, we develop an adaptive gradient accumulation strategy based on the masked sample ratio to speed up training. We verify the proposed AS-Softmax on a variety of text multi-class, text multi-label, text token classification, image classification and audio classification tasks with class sizes ranging from 5 to 5000+. The results show that AS-Softmax consistently outperforms softmax and its variants, and the loss of AS-Softmax is remarkably correlated with classification performance in validation. Furthermore, adaptive gradient accumulation strategy can bring about 1.2x training speedup comparing with the standard softmax while maintaining classification effectiveness.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- (21 more...)
Cross-lingual Embedding Clustering for Hierarchical Softmax in Low-Resource Multilingual Speech Recognition
Yang, Zhengdong, Liu, Qianying, Li, Sheng, Cheng, Fei, Chu, Chenhui
We present a novel approach centered on the decoding stage of Automatic Speech Recognition (ASR) that enhances multilingual performance, especially for low-resource languages. It utilizes a cross-lingual embedding clustering method to construct a hierarchical Softmax (H-Softmax) decoder, which enables similar tokens across different languages to share similar decoder representations. It addresses the limitations of the previous Huffman-based H-Softmax method, which relied on shallow features in token similarity assessments. Through experiments on a downsampled dataset of 15 languages, we demonstrate the effectiveness of our approach in improving low-resource multilingual ASR accuracy.
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- (9 more...)
- Research Report > New Finding (0.93)
- Research Report > Promising Solution (0.66)
ARMAX identification of low rank graphical models
In large-scale systems, complex internal relationships are often present. Such interconnected systems can be effectively described by low rank stochastic processes. When identifying a predictive model of low rank processes from sampling data, the rank-deficient property of spectral densities is often obscured by the inevitable measurement noise in practice. However, existing low rank identification approaches often did not take noise into explicit consideration, leading to non-negligible inaccuracies even under weak noise. In this paper, we address the identification issue of low rank processes under measurement noise. We find that the noisy measurement model admits a sparse plus low rank structure in latent-variable graphical models. Specifically, we first decompose the problem into a maximum entropy covariance extension problem, and a low rank graphical estimation problem based on an autoregressive moving-average with exogenous input (ARMAX) model. To identify the ARMAX low rank graphical models, we propose an estimation approach based on maximum likelihood. The identifiability and consistency of this approach are proven under certain conditions. Simulation results confirm the reliable performance of the entire algorithm in both the parameter estimation and noisy data filtering.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Asia > China > Beijing > Beijing (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.54)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.49)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.49)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.46)
Analysis of the BraTS 2023 Intracranial Meningioma Segmentation Challenge
LaBella, Dominic, Baid, Ujjwal, Khanna, Omaditya, McBurney-Lin, Shan, McLean, Ryan, Nedelec, Pierre, Rashid, Arif, Tahon, Nourel Hoda, Altes, Talissa, Bhalerao, Radhika, Dhemesh, Yaseen, Godfrey, Devon, Hilal, Fathi, Floyd, Scott, Janas, Anastasia, Kazerooni, Anahita Fathi, Kirkpatrick, John, Kent, Collin, Kofler, Florian, Leu, Kevin, Maleki, Nazanin, Menze, Bjoern, Pajot, Maxence, Reitman, Zachary J., Rudie, Jeffrey D., Saluja, Rachit, Velichko, Yury, Wang, Chunhao, Warman, Pranav, Adewole, Maruf, Albrecht, Jake, Anazodo, Udunna, Anwar, Syed Muhammad, Bergquist, Timothy, Chen, Sully Francis, Chung, Verena, Conte, Gian-Marco, Dako, Farouk, Eddy, James, Ezhov, Ivan, Khalili, Nastaran, Iglesias, Juan Eugenio, Jiang, Zhifan, Johanson, Elaine, Van Leemput, Koen, Li, Hongwei Bran, Linguraru, Marius George, Liu, Xinyang, Mahtabfar, Aria, Meier, Zeke, Moawad, Ahmed W., Mongan, John, Piraud, Marie, Shinohara, Russell Takeshi, Wiggins, Walter F., Abayazeed, Aly H., Akinola, Rachel, Jakab, András, Bilello, Michel, de Verdier, Maria Correia, Crivellaro, Priscila, Davatzikos, Christos, Farahani, Keyvan, Freymann, John, Hess, Christopher, Huang, Raymond, Lohmann, Philipp, Moassefi, Mana, Pease, Matthew W., Vollmuth, Phillipp, Sollmann, Nico, Diffley, David, Nandolia, Khanak K., Warren, Daniel I., Hussain, Ali, Fehringer, Pascal, Bronstein, Yulia, Deptula, Lisa, Stein, Evan G., Taherzadeh, Mahsa, de Oliveira, Eduardo Portela, Haughey, Aoife, Kontzialis, Marinos, Saba, Luca, Turner, Benjamin, Brüßeler, Melanie M. T., Ansari, Shehbaz, Gkampenis, Athanasios, Weiss, David Maximilian, Mansour, Aya, Shawali, Islam H., Yordanov, Nikolay, Stein, Joel M., Hourani, Roula, Moshebah, Mohammed Yahya, Abouelatta, Ahmed Magdy, Rizvi, Tanvir, Willms, Klara, Martin, Dann C., Okar, Abdullah, D'Anna, Gennaro, Taha, Ahmed, Sharifi, Yasaman, Faghani, Shahriar, Kite, Dominic, Pinho, Marco, Haider, Muhammad Ammar, Aristizabal, Alejandro, Karargyris, Alexandros, Kassem, Hasan, Pati, Sarthak, Sheller, Micah, Alonso-Basanta, Michelle, Villanueva-Meyer, Javier, Rauschecker, Andreas M., Nada, Ayman, Aboian, Mariam, Flanders, Adam E., Wiestler, Benedikt, Bakas, Spyridon, Calabrese, Evan
We describe the design and results from the BraTS 2023 Intracranial Meningioma Segmentation Challenge. The BraTS Meningioma Challenge differed from prior BraTS Glioma challenges in that it focused on meningiomas, which are typically benign extra-axial tumors with diverse radiologic and anatomical presentation and a propensity for multiplicity. Nine participating teams each developed deep-learning automated segmentation models using image data from the largest multi-institutional systematically expert annotated multilabel multi-sequence meningioma MRI dataset to date, which included 1000 training set cases, 141 validation set cases, and 283 hidden test set cases. Each case included T2, T2/FLAIR, T1, and T1Gd brain MRI sequences with associated tumor compartment labels delineating enhancing tumor, non-enhancing tumor, and surrounding non-enhancing T2/FLAIR hyperintensity. Participant automated segmentation models were evaluated and ranked based on a scoring system evaluating lesion-wise metrics including dice similarity coefficient (DSC) and 95% Hausdorff Distance. The top ranked team had a lesion-wise median dice similarity coefficient (DSC) of 0.976, 0.976, and 0.964 for enhancing tumor, tumor core, and whole tumor, respectively and a corresponding average DSC of 0.899, 0.904, and 0.871, respectively. These results serve as state-of-the-art benchmarks for future pre-operative meningioma automated segmentation algorithms. Additionally, we found that 1286 of 1424 cases (90.3%) had at least 1 compartment voxel abutting the edge of the skull-stripped image edge, which requires further investigation into optimal pre-processing face anonymization steps.
- North America > United States > California > San Francisco County > San Francisco (0.29)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.15)
- Europe > Switzerland > Zürich > Zürich (0.14)
- (55 more...)
- Health & Medicine > Therapeutic Area > Oncology (1.00)
- Health & Medicine > Therapeutic Area > Neurology (1.00)
- Health & Medicine > Nuclear Medicine (1.00)
- (3 more...)
Statistical Mechanics and Artificial Neural Networks: Principles, Models, and Applications
Böttcher, Lucas, Wheeler, Gregory
The field of neuroscience and the development of artificial neural networks (ANNs) have mutually influenced each other, drawing from and contributing to many concepts initially developed in statistical mechanics. Notably, Hopfield networks and Boltzmann machines are versions of the Ising model, a model extensively studied in statistical mechanics for over a century. In the first part of this chapter, we provide an overview of the principles, models, and applications of ANNs, highlighting their connections to statistical mechanics and statistical learning theory. Artificial neural networks can be seen as high-dimensional mathematical functions, and understanding the geometric properties of their loss landscapes (i.e., the high-dimensional space on which one wishes to find extrema or saddles) can provide valuable insights into their optimization behavior, generalization abilities, and overall performance. Visualizing these functions can help us design better optimization methods and improve their generalization abilities. Thus, the second part of this chapter focuses on quantifying geometric properties and visualizing loss functions associated with deep ANNs.
- North America > United States > New York > Richmond County > New York City (0.14)
- North America > United States > New York > Queens County > New York City (0.14)
- North America > United States > New York > New York County > New York City (0.14)
- (31 more...)
- Overview (0.68)
- Research Report > New Finding (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.92)
Practice with Graph-based ANN Algorithms on Sparse Data: Chi-square Two-tower model, HNSW, Sign Cauchy Projections
Li, Ping, Zhao, Weijie, Wang, Chao, Xia, Qi, Wu, Alice, Peng, Lijun
Sparse data are common. The traditional ``handcrafted'' features are often sparse. Embedding vectors from trained models can also be very sparse, for example, embeddings trained via the ``ReLu'' activation function. In this paper, we report our exploration of efficient search in sparse data with graph-based ANN algorithms (e.g., HNSW, or SONG which is the GPU version of HNSW), which are popular in industrial practice, e.g., search and ads (advertising). We experiment with the proprietary ads targeting application, as well as benchmark public datasets. For ads targeting, we train embeddings with the standard ``cosine two-tower'' model and we also develop the ``chi-square two-tower'' model. Both models produce (highly) sparse embeddings when they are integrated with the ``ReLu'' activation function. In EBR (embedding-based retrieval) applications, after we the embeddings are trained, the next crucial task is the approximate near neighbor (ANN) search for serving. While there are many ANN algorithms we can choose from, in this study, we focus on the graph-based ANN algorithm (e.g., HNSW-type). Sparse embeddings should help improve the efficiency of EBR. One benefit is the reduced memory cost for the embeddings. The other obvious benefit is the reduced computational time for evaluating similarities, because, for graph-based ANN algorithms such as HNSW, computing similarities is often the dominating cost. In addition to the effort on leveraging data sparsity for storage and computation, we also integrate ``sign cauchy random projections'' (SignCRP) to hash vectors to bits, to further reduce the memory cost and speed up the ANN search. In NIPS'13, SignCRP was proposed to hash the chi-square similarity, which is a well-adopted nonlinear kernel in NLP and computer vision. Therefore, the chi-square two-tower model, SignCRP, and HNSW are now tightly integrated.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > Canada > Quebec > Montreal (0.04)
- (22 more...)
Calibrated Propensity Scores for Causal Effect Estimation
Deshpande, Shachi, Kuleshov, Volodymyr
Propensity scores are commonly used to balance observed covariates while estimating treatment effects. Estimates obtained through propensity score weighing can be biased when the propensity score model cannot learn the true treatment assignment mechanism. We argue that the probabilistic output of a learned propensity score model should be calibrated, i.e. a predictive treatment probability of 90% should correspond to 90% of individuals being assigned the treatment group. We propose simple recalibration techniques to ensure this property. We investigate the theoretical properties of a calibrated propensity score model and its role in unbiased treatment effect estimation. We demonstrate improved causal effect estimation with calibrated propensity scores in several tasks including high-dimensional genome-wide association studies, where we also show reduced computational requirements when calibration is applied to simpler propensity score models.
- North America > United States > New York > New York County > New York City (0.14)
- North America > Greenland (0.04)
- North America > Barbados > Saint Michael > Bridgetown (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
- Information Technology > Artificial Intelligence > Natural Language (0.67)